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ABSTRACT 


~y  Markov  Decision  Processes  deal  with  sequential  decision 
making  in  stochastic  systems*  Existing  solution  techniques 
provide  powerful  tools  for  determining  the  optimal  policy  set 
in  such  systems,  however,  many  practical  problems  have 
extremely  large  state  and  action  spaces  making  them 
computationally  intractable.  Typically,  the  state  variable 
definition  is  n-dimensional  and  the  number  of  states  expands 
at  a  rate  proportional  to  the  power  of  n.  For  such  large 
problems,  the  need  for  large  amounts  of  random  access  memory 
and  computation  time  restricts  the  ability  to  obtain 
solutions.  The  purpose  of  this  paper  is  to  both  present  a 
methodology  which  facilitates  the  solution  of  large  scale 
problems,  and  provide  computational  results  indicating  the 


value  of  the  approach. 


INTRODUCTION 


An  undiscounted,  infinite  horizon,  discrete  time  Markov 
Decision  Process  problem  can  be  described  as  follows:  A 
system  has  N  states  and  for  each  state  i  the  decision  maker 
can  select  any  alternative  from  set  Kj[  •  If  the  system  is  in 
state  i  and  alternative  k  is  selected,  the  system  will  make  a 
transition  to  state  j,  j*l,...,N,  according  to  the  given 
transition  probability  vector  l Piik , pi2k , • • • , plNk] »  and  earn 
a  reward  (cost)  r^jk. 

A  policy  is  defined  as  a  collection  of  selected 
alternatives  for  all  states.  v^(n)  denotes  the  total  expected 
earnings  over  the  next  n  transitions  if  the  system  is  now  at 
state  1  and  the  optimal  policy  is  followed. 

The  Dynamic  Programming  formulation  developed  by  Howard 
[5]  for  the  finite  horizon  problem  gives  the  following 
recursive  equations  : 

N 

vi(n+l)  -  MAX {  l  pljk(rijfc+  vjU))} 
kEKi  J"1 

N 

-  MAXfqj^  l  p1jkvj(n)},  (i-l,...,N), 
kEKi  j-1 


N 

where  q<k  -  S  Pnkrn'1, 


Since  Che  infinite  horizon  process  is  allowed  to  make 
infinitely  many  transitions,  the  total  expected  earnings  is 
eventually  infinite.  The  goal  of  infinite  horizon  Markov 
Decision  Process  optimization  is  thus  to  find  the  policy  that 
maximizes  (minimizes)  the  expected  gain  (g)  per  transition. 

N 

g  “  I"i<li* 
i=  1 

where  is  the  limiting  state  probability  for 

state  i. 

Howard's  Policy  Iteration  method  [5]  can  be  applied  to 
infinite  horizon  problems.  In  this  procedure,  relative 
values,  v^s,  are  determined  for  a  given  policy  by  solving  a 
set  of  N  simultaneous  linear  equtlons.  These  relative  values 
are  used  in  a  policy  improvement  procedure  to  find  a  policy 
with  a  higher  (lower)  gain.  The  new  policy  is  used  to 
determine  a  new  set  of  relative  values  vi .  This  process  is 
repeated  until  no  better  policy  can  be  found.  The 
disadvantage  of  Howard's  Policy  Iteration  for  large  scale 
problems  is  the  computational  effort  required  to  solve  the  N 
simultaneous  linear  equations. 

A  successive  approximation  approach  for  solving  the  N 
equations  can  be  shown  empirically  to  be  computationally  more 
efficient.  The  successive  approximation  method  can  be  stated 
as  follows  : 

If  all  transition  probability  matrices  (of  all  possible 
policies)  are  single  chained,  and  all  those  associated  with 


maximal  gain  policies  are  aperiodic,  the  following  recursive 
computation  will  eventually  converge  to  the  optimal  policy 
(12]  : 


N 

y*( n+1 )  ■  MAX{qi^+  J  p^j^-Wj(n)}» 
k£Ki  j-1 

wi(n+l)  -  y.^n+1  )-yN(n+l  )  , 
starting  with  w^O)  -  yi(0)-yN(0). 

It  also  can  be  shown  that  yN(n)  converges  to  the  optimal 
gain  and  the  w^(n)'s  converge  to  the  relative  values  as  in 
Howard's  method. 

Morton  [9]  showed  that  a  fixed  policy  successive 
approximation  guarantees  the  convergence  of  the  relative 
values  ( i .e . ,w^(n) ' s)  in  on  the  order  of  l/(l-3)  iterations, 
where  3  is  the  second  largest  eigenvalue  of  the  transition 
matrix  of  this  fixed  policy.  He  suggested  a  method  similar  to 
Howard's  Policy  Iteration  procedure,  except  that  the  relative 
values  are  computed  by  the  fixed  policy  successive 
approximation.  In  this  paper,  we  modify  Morton's  approach  in 
order  to  gain  computational  efficiency. 

OTHER  RELEVANT  LITERATURE 

We  will  limit  discussion  to  the  undiscounted,  discrete 
time  Markov  Decision  Process  problem.  Bellman  [1]  first 
proposed  the  Markov  Decision  Process  problem.  Howard  [5] 
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presented  the  Policy  Iteration  approach,  which  later  was 
generalized  to  other  classes  of  Markov  Decision  Processes. 
His  work  also  gave  a  detailed  discussion  of  the  modeling 
concepts  of  Markov  Decision  Processes. 

White  [17]  first  proved  the  asymptotic  behavior  of 
successive  approximations  under  the  condition  that  there 
exists  a  path  of  u  step  transitions  (for  all  possible 
policies)  connected  to  some  state  from  all  other  states, 
where  u  is  a  positive  integer.  Schweitzer  [12]  later  showed 
that  if  all  transition  probability  matrices  are  single 
chained  and  aperiodic,  then  the  successive  approximation  for 
Howard's  Dynamic  Programming  approach  [5]  results  in  the 
asymptotic  convergence  of  the  total  expected  earnings  (vi(n)) 
to  a  linear  function. 

In  addition,  alternate  Linear  Programming  formulations 
were  presented  by  Manne  [7],  Wolfe  and  Dantzig  [18],  and 
Wagner  [16].  It  was  shown  that  the  optimal  solutions  are 
always  pure  policies.  It  might  also  be  noted  that  for  a 
Markov  Decision  Process,  Howard's  Policy  Iteration  is 
equivalent  to  the  block  pivoting  approach  of  Linear 
Programming . 

Odoni  [10]  presented  upper  and  lower  bounds  for  the  gain 
and  relative  state  costs  using  the  successive  approximation 
approach,  thus  establishing  rational  stopping  criteria  for 
the  procedure.  A  more  general  condition  of  the  asymptotic 
behavior  of  successive  approximation  for  the  multichain  case 
was  given  by  Schweitzer  and  Federgruen  [13]. 

The  generalizations  of  Policy  Iteration  and  successive 


approximation  to  periodic  Markov  Decision  Processes  have  been 
presented  by  Peterson  [11]  and  Su  and  Denlnger  [15].  Zaldlvar 
and  Hodgson  [19]  proposed  an  extrapolation  procedure  for 
speeding  convergence  of  the  relative  values.  Hodgson  and 
Koehler  [4]  investigated  scaling  techniques  to  speed 
convergence  in  Markov,  semi-Markov  and  continuous  parameter 
Decision  Processes.  Schweitzer  and  Seldmann  [14]  suggested  a 
method  of  polynomial  approximations  for  the  relative  values 
by  noticing  that  the  relative  values  could  be  fit  accurately 
on  the  state  space. 


OBSERVATIONS 


Three  observations  relative  to  real-world  large  scale 
Markov  Decision  Process  problems  can  be  made. 

(1)  We  have  examined  many  production-inventory  problems 
formulated  as  Markov  Decision  Processes.  Typically,  a  large 
portion  of  states  are  transient  for  an  optimal  policy  set. 
Since  recurrent  states  form  a  closed  communicating  class, 
then  by  the  fixed  policy  successive  approximation,  the 
alternatives  at  transient  states  do  not  Influence  the 
calculation  of  the  relative  values  of  the  recurrent  states  or 
the  gain  of  the  system. 

(2)  The  recurrent  states  of  the  optimal  policy  tend  to 
cluster  in  a  small  number  (possibly  1)  of  compact  groups. 

(3)  The  state  space  is  vector  valued,  i.e., 

i«(i^>i2 . is)  and  the  alternatives  can  also  be  described 

as  vector  valued,  i.e.,  k-(k,,k, . k„) .  Note  that  the 


dimension  (s)  of  the  state  variable  could  be  different  from 


the  dimension  (m)  of  the  alternative  vector.  An 
interpretation  of  the  state  variable  i  might  be  the  amount  of 
inventory  of  product  at  each  stage  of  production.  An 
interpretation  of  the  alternative  vector  k  might  be  the 
amount  of  product  ordered  at  each  stage  of  production.  It 
should  be  noted  that  in  the  following  state  designations  1 
and  j  refer,  when  appropriate,  to  the  vector  state 
designation  and  k  refers  to  the  vector  alternative 
designation . 

BASIC  CONCEPTS  FOR  THE  PROCEDURE 

(1)  I-step  transient  states 

For  a  given  transition  matrix,  let  R  denote  the  set 
of  recurrent  states  and  T  denote  the  set  of  transient  states. 
We  know  that  if  i  er  ,  and  jeT,  then  pi;j  =0,  i.e.,  there  is  no 
transition  from  a  recurrent  state  to  a  transient  state. 

Note  that  the  fixed  policy  successive  approximation 
is  as  follows , 


N 

y i ( n+l )  =  qi+  l  Pi jwj( n) » 

J-l  (1) 

w^Cn+1)  =  yi(  n+l  )-yjj(  n+l  )  .  (i“l,..,N) 

For  a  state  ieR,  Pij>0  only  for  jeR.  Thus  it  is  possible 
to  ignore  the  computation  of  wj(n)  and  yj(n)  for  transient 
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states  and  limit  the  range  of  the  summation  in  (1)  to  the 
recurrent  states  without  altering  the  result  of  the 
computation  for  the  recurrent  states. 

In  addition,  many  of  the  transient  states  can  be 
classified  as  "one-step  transient  states",  "I-step 
transient  states”,  etc.  A  state  is  said  to  be  a  "I-step 
transient  state”  if,  upon  starting  in  this  state,  the  system 
will  reside  in  a  recurrent  state  after  exactly  I-transitions 
with  probability  =  1.  Clearly,  within  one  step,  a  two-step 
transient  state  can  only  reach  recurrent  states  and  one-step 
transient  states. 

In  computing  equation  (1),  if  i  is  a  one-step  transient 
state,  the  relative  value  wt  can  be  calculated  exactly  with 
one  iteration  after  convergence  of  the  relative  values  of  the 
recurrent  states  is  achieved.  As  for  a  two-step  transient 
state,  it  is  necessary  only  to  calculate  its  relative  value 
for  two  iterations  after  the  convergence  of  recurrent  states 
is  achieved. 

(2)  Neighboring  states 

The  neighboring  states  to  state  i  are  defined  to  be 
those  states  that  are  within  a  radius  r  of  the  state  i  in  the 
s-d imens iona 1  state  space.  For  example,  a  two  dimensional 
state  (  i  ,  i  2  ^  ^as  t*ie  following  neighboring  states  within  a 
radius  of  1  :  (i1+lfi2),  <  i2  ,  i2+l )  ,  (ij-l.ij),  (i1,i2-l)* 

It  should  be  noted  that  within  the  concept  of  Markov 
Processes,  it  Is  useful  to  think  of  the  neighborhood  of  a  set 
of  states  (e.g.,  the  set  of  recurrent  states).  Since  there 
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would  normally  be  considerable  overlap  in  the  Identification 
of  neighboring  states,  the  neighborhood  of  the  recurrent  set 
of  a  Markov  Process  might  contain  fewer  states  than  are  in 
the  recurrent  set  itself. 

The  concept  of  neighboring  states  will  help  restrict 
computation  to  a  limited  set  of  states.  Consequently,  it  will 
limit  both  the  amount  of  computation  and  active  memory 
required . 

A  PROCEDURE 

We  now  state  the  procedure.  The  conditions  under  which 
the  procedure  is  guaranteed  to  achieve  the  optimal  policy  set 
will  be  given  in  the  next  section. 

1.  For  a  given  policy,  compute  the  limiting  state 
probabilities  in  order  to  determine  the  set  of  recurrent 
states,  R. 

2.  Find  neighboring  states  within  some  radius  r  for  all 
recurrent  states.  Let  the  set  of  all  the  neighboring  states 
not  in  R  be  A. 

3.  Find  all  the  states  that  are  reachable  (in  one  or 
more  transitions)  from  A  but  not  including  the  states  in  A  or 
R.  Let  the  set  of  these  states  be  C. 

A.  Implement  the  fixed  policy  successive  approximation 
for  the  states  in  the  set  R+A+C  for  a  fixed  number  of 
i terations . 

5.  Making  use  of  the  relative  values  w,^  calculated  in 
step  A  for  states  in  R+A+C,  implement  the  policy  improvement: 


max 


{qi^+  £  pijkwj(n)) >  i,jER+A+C, 

’  j 

where  •  denotes  the  set  of  alternatives  for  state 
1  that  can  only  make  transition  to  states  in  R+A+C. 

6.  If  there  is  no  change  in  the  policy  set  and  the 
w^(n)'s  have  "converged",  stop.  Otherwise,  go  to  step  1. 

The  purpose  of  determining  set  C  in  step  3  is  to  find 
all  the  states  that  are  in  the  paths  from  neighboring  states 
to  recurrent  states.  This  is  relatively  easy  computationally 
because  of  the  structure  of  the  state  space  in  real  problems, 
i*e.»  a  "neighboring”  state  should  be  very  "close"  to  the 
recurrent  states  in  the  state  space.  Computational  experience 
has  shown  that  C  is  typically  a  very  small  set. 

In  step  4,  the  fixed  policy  successive  approximation  is 
restricted  to  the  set  R+A+C.  Since  all  the  states  in  A+C  have 
a  path  to  R  (note  that  many  states  in  A+C  are  typically 
I— step  transient  with  small  I),  fixed  policy  successive 
approximation  guarantees  the  convergence  of  the  relative 
values . 

Step  5  restricts  the  policy  improvement  to  the  set  of 
alternatives  that  communicate  within  R+A+C.  If  an  alternative 
communicates  with  a  state  j  outside  R+A+C,  it  is  not 
possible  to  compare  this  alternative  with  other  alternatives 
because  the  relative  value  (wj(n))  is  unknown.  Note  that  the 
new  policy  generated  by  this  policy  improvement  procedure 
restricts  all  states  in  R+A+C  to  communicate  only  with  the 
states  in  R+A+C.  In  other  words,  the  new  recurrent  chain,  R’, 
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contains  only  states  in  R+A+C .  With  this  new  recurrent  chain, 
we  have  a  new  set  R'+A’+C'  which  could  be  used  to  find  new 
relative  values  and,  consequently,  a  new  policy  set. 

Notice  that  step  1  is  actually  efficient  in 
computation.  Initially  a  single  state  is  assigned  a  state 
probability  of  1  and  then  the  state  probabilities  are 
computed  recursively  by  using  tt( n+1  )**  n( n)P  ,  where  ff(n)is  the 
state  probability  vector  after  n  transitions  and  P  is  the 
transition  probability  matrix.  In  real  Markov  Decision 
problems  iT(n)  is  usually  a  sparse  vector,  the  actual 
computations  at  each  iteration  would  only  include  these 
states  that  have  positive  entries  of  7r(n)  as  well  as 
reachable  states  from  these  states.  Given  that  the 
underlining  Markov  process  is  single-chained,  it  will 
eventually  converge  to  its  limiting  state  probabilities  and 
thus  find  the  set  of  recurrent  states.  In  the  actual 
computations,  we  would  also  restrict  the  number  of  iterations 
to  a  predeterminated  number  (say,  10).  Even  though  some 
states  that  are  actually  recurrent  may  not  be  included  in  the 
recurrent  set  within  the  given  number  of  iterations,  if  there 
is  at  least  one  recurrent  state  in  this  "incomplete" 
recurrent  set,  step  3  of  the  procedure  will  find  all  the 
other  recurrent  states. 

The  savings  in  computation  of  the  procedure  directly 
relate  to  the  scheme  that  only  a  part  of  states  (R+A+C)  are 
actually  involved  in  the  computations  at  each  iteration. 
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CONDITIONS  FOR  CONVERGENCE  OF  THE  PROCEDURE 

First,  the  definition  of  a  concave  (convex)  function  in 
the  discrete  space  is  introduced. 

Definition  Let  ft  be  an  m- d imen s i ona 1  discrete  state 

space  and  ( k ^ , k 2 ,  •  •  • , kn)  e  where  the  kj  are 

integers.  A  function  f  (  k l t k 2 , . . ,  km )  is  said  to  be 
concave  (convex)  if  for  any  vectors  ka,k^,kc  e  ^  ,  where 

kc-  )*b  for  some  x  £(0,1), 

we  have  Af(ka)  +  (1-*  )f(kb)  1  f(kc) 

(  ^(ka)+(l- *  )f(kb)  >  f(kc)  ). 

In  general,  assume  that  there  exits  an  s— dimensional 

state  variable  (ij,i2 . is)  and  an  m-dimensional 

alternative  variable  ( k^ , k2 , . . . , km)  at  each  state.  Assume 
also  the  objective  is  to  maximize  the  gain  of  the  system.  The 
procedure  will  find  the  optimal  solution  if  the  system 
satisfies  the  following  two  conditions  : 

1.  Consider  an  alternative  variable  (k^ ,k2 » • • • »km) 
that  brings  the  system  to  a  set  of  states,  say  S.  The 
neighborhood  chosen  to  implement  the  procedure  must  satisy 
the  following:  If  we  increase  (or  decrease)  any  entry  of  this 
alternative  variable  by  1  unit,  for  example, 

(  kj^  ,  k2  ,  •  •  •  ,  ki+1  ,  .  .  .  ,  km)  ,  the  new  alternative  will  bring  the 
system  in  one  transition  to  a  set  of  states  that  is  included 
in  the  states  in  S  plus  the  neighborhood  of  S. 

2.  For  each  state  i,  the  test  quantity 


q±<  It  1  ,^2  »  *  *  *  »  £  Pij'^l‘^2 . )  w  j  is  a  concave  function 

in  (ki ,k2 , • • • ,km)  ^  (The  test  quantity  needs  to  be  convex  if 
the  objective  is  to  minimize  the  gain). 

If  the  above  two  conditions  are  true,  then  whenever 
the  current  policy  is  not  optimal  for  a  given  state  in  the 
current  set  R,  there  must  exist  some  better  alternative 
(i.e.,  larger  (smaller)  test  quantity)  that  keeps  the  system 
within  the  current  R+A+C  set.  This  alternative  would  be  found 
by  the  modified  policy  iteration.  This  Implies  that  under 
conditions  1  and  2,  if  the  policy  set  for  states  in  R  is  not 
optimal,  then  a  new  (better)  policy  set  can  be  found.  This 
procedure  can  be  repeated  eventually  terminating  with  the 
optimal  policy  set.  The  problem  is  that  while  many  models 
satisfy  condition  1,  few  models  will  fully  satisfy  condition 
2.  However,  in  the  following  section,  it  will  be  seen  that, 
empirically  at  least,  this  approach  to  computation  is  both 
robust  and  extremely  efficient. 

A  TEST  PROBLEM 

In  this  section  a  test  problem  is  described.  The  problem 
does  not  fully  satisfy  condition  2.  However,  we  will  see  in 
the  following  section  (COMPUTATIONAL  RESULTS)  the  robustness 
of  the  procedure  in  solving  the  problem.  A  multistage 
production-inventory  system  has  been  used  as  a  test  problem 
for  our  methodology.  Clark  and  Scarf  [3]  formulated  this 
problem  and  proposed  a  heuristic  algorithm  to  find  control 
rules  for  the  system.  Lambrecht,  Luyten  and  Muckstadt  [61 


formulated  it  as  a  Markov  Decision  Process  and  offered  an 
interesting  computation  comparison  of  these  two  approaches. 

Consider  a  two-stage  production-inventory  system  which 
operates  on  a  period  to  period  basis.  (See  FIGURE  1  for  a 
schematic  of  the  system.)  At  the  beginning  of  each  period, 
production  at  each  stage  must  be  determined.  There  is  an  in- 
process  and  a  finished  product  inventory.  The  units  of 
Inventory  and  production  are  Integers.  There  is  a  one  period 
delay  for  production  at  the  second  stage.  The  system  incurs 
set  up  and  variable  costs  of  production,  inventory  costs  and 
shortage  costs.  The  maximum  inventory  level  at  each  facility 
is  restricted.  The  demand  for  a  period  is  expressed  as  a 
probability  mass  function.  This  is  an  infinite  horizon, 
undiscounted,  discrete  time  Markov  Decision  Process  problem. 
The  objective  is  to  minimize  the  total  cost  rate  of  the 
system. 

The  following  terminology  is  useful  in  describing  the 
model : 

1*  State  variable  ( 1-  ^  ,  i  2 )  :  ij  is  a  nonnegative  integer 

value  representing  the  on-hand  inventory  level  of  stage  1,  i2 
i 8  a  nonnegative  integer  representing  the  on-hand  plus  on- 

order  inventory  of  stage  2.  Let  i-Cij,^)* 

2.  Alternative  variable  C k ^ , k 2 )  '  k^  is  a  nonnegative 
integer  representing  the  number  of  units  to  be  produced  at 
stage  1,  v  --  a  nonnegative  Integer  representing  the  number 

of  units  that  to  be  produced  at  stage  2.  Let  k*(k]_,k2)* 

3.  Demand  D  :  a  random  variable  which  can  take  on  the 

values  0,  1,  2 . depicting  the  units  of  finished  product 


In-Process 

Raw  Product 

Material  Inventory 


Stage  1 


i 

i 

i 

i 


F inlshed 

Product 

Inventory 


Demand 
- > 


FIGURE  1 


•  • 
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ordered  according  to  probability  mass  function,  p(d). 

4.  Maximum  inventory  LI  and  L2  :  Ll  is  the  maximum 
inventory  of  stage  1  and  L2  is  the  maximum  Inventory  of  stage 
2. 

The  state  variables  are  defined  to  be  the  integer  pairs 
(ilti2),  with  ii  <  Ll  and  i2  <  1.2 .  If  the  the  system  is  now 
at  state  (ij,i2)  and  alternative  (k^,k2)  is  chosen,  the 
transition  Is 


<il»i2> - >  (ii+kl-‘t2»<i2"D)++k2>» 


(2) 


where  (x)+-  max(O.x) 

To  assure  that  the  inventory  remains  at  or  below  the 
maximum  inventory  level,  the  sum  of  on  hand  and  on  order 
inventory  is  limited  to  be  less  than  or  equal  to  L2  (the 

maximum  inventory  level  at  stage  2),  i.e.,  k2+i2  £  1,2  * 

It  can  be  shown  that  if  the  current  alternative  at  some 

state  is  (k1,k2)  and  the  state  communicates  directly  to  a  set 
of  states,  say  R,  then  an  increase  (or  decrease)  of  one  unit 
of  kx  or  k2  will  bring  the  system  to  a  set  of  states  that 
is  in  R+A,  where  A  is  the  neighboring  set  of  radius  r-  fl . 
Notice  that  this  is  equivalent  to  condition  1  of  the 
procedure . 

From  our  computational  experience  for  this  model,  the 
relative  costs  w±' s  normally  form  a  convex  function  in 
(il,i2)  (A  similar  observation  was  given  by  Schweitter  and 
Seidmann  [14]).  Making  use  of  this  observation  and  the 
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state  transitions  described  in  (2),  it  can  be  shown  that 


IpijS  is  normally  a  convex  function  in  (k^,k2). 
The  cost  of  a  transition  given  a  demand  d  is 


rl,j(d)k  "  HCl*(i1+i2)+HC2*i2+ 

VCl*(k1)+VC2*(k2)+ 

min( 1 ,kL )*SCl+min(l ,k2 )*SC2+ 

SHC*(i2-d)-, 

where  HC1  and  HC2  are  the  echelon  inventory  holding 

costs  of  stage  1  and  2,  VC1  and  VC2  are  the  variable  costs  of 

stage  1  and  2,  SCI  and  SC2  are  the  setup  costs  of  stage  1 

and  stage  2,  SHC  is  the  shortage  cost,  and  (x)-—min(0,x) . 

Note  that  j  is  determined  by  d. 

It  is  easy  to  see  that  r^  j(d)^  is  a  convex  function  in 

(klfk2)  for  kj^  >  1  and  k2  >  1.  Thus  q^*  £p(d)r4  j(d)k  18 

d 

also  a  convex  function  in  (k|,k2)  for  k^  >  1  and  k2  >  1. 

If 

However,  is  not  convex  for  k-^  >  0  and  k2  >  0  due  to  the 

setup  costs,  SCI,  SC2. 

Notice  that  the  above  discussion  of  the  convexity  of  q^*1 


is  also 

true  for  the 

test  quantity 

«ik  +  I 

pijS- 

We 

have 

shown 

that  the  test 

quant  1 

ty  of 

this  test 

problem 

does 

not 

"completely” 

satisfy 

the 

convexity 

requirement.  However,  all  the  test  problems  were  solved 
during  our  computational  experiments  provided  a  sufficiently 
large  radius  r  was  used.  The  following  offers  some  insights 
into  observed  robustness  of  the  algorithm  for  this  set  of 


test  problems.  Since  the  recurrent  states  tend  to  cluster  in 
a  small  number  of  groups  (possibly  one),  the  range  of  policy 
choices  for  many  states  tends  not  to  be  constrained  by  the 
limit  of  the  neighboring  states  within  the  specified  radius, 
r.  This  is  because  the  collection  of  neighboring  states  of 
all  recurrent  states  usually  includes  the  neighboring  states 
within  a  radius  larger  than  r  of  a  single  recurrent  state. 

COMPUTATIONAL  RESULTS 

The  data  examined  by  Lambrecht,  Luyten  and  Muckstadt  [6] 
were  used  as  test  data.  In  addition,  the  formulation  was 
extended  to  a  three  stage  problem.  All  data  are  given  in 
TABLE  1.  In  TABLES  2-7,  the  CPU  time  in  VAX  11/750  virtual 
seconds  and  the  number  of  policy  iterations  necessary  to 
solve  each  of  the  problems  are  given.  A  total  of  36  different 
problems  were  solved.  Each  was  solved  conventionally  using 
all  states,  and  solved  using  the  neighborhood  procedure  with 
radii  (r)  of  1,  /Z,  ✓T,  2,  /5 .  Of  the  180  solutions  using  the 
neighborhood  approach,  all  but  1  were  solved  optimally.  10 
cheap  iterations  were  used  per  policy  iteration. 

The  convergence  criteria  used  were:  1.  no  policy  changes 
for  any  state  within  the  neighborhood  (of  radius  r)  and  2. 
the  sum  of  the  absolute  values  of  the  residuals  [19]  must  be 
less  than  a  predetermined  value.  The  value  was  set  nominally 
at  an  order  of  magnitude  above  the  round-off  error  capability 
of  the  computer  for  each  particular  problem.  For  the  121 
state  problems  (TABLES  2  and  3),  the  average  number  of 
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alternatives  per  state  Is  50  and  relatively  small  amounts  of 


virtual  storage  are  required.  For  the  1331  state  problems 
(TABLES  4  and  5)  and  10648  state  problems  (TABLES  6  and  7), 
the  average  number  of  alternatives  per  state  are  147  and  137 
respectively,  and  the  requirements  for  virtual  storage  are 
approximately  2  megabytes  and  15  megabytes  respectively  (the 
number  of  alternatives  per  state  of  the  10648  state  problems 
was  limited  artificially). 

For  121  state  problems  (TABLES  2  and  3),  the  reduction 
in  computational  effort  runs  generally  about  67.2 %.  An 
exception  is  problem  1,  which  has  an  unusually  large  number 
of  recurrent  states.  For  the  1331  state  problems  the 
reduction  averages  93.6%.  For  the  10648  state  problems, 
the  average  reduction  is  greater  than  99.5%  in  each  case. 
In  almost  all  cases  the  optimal  solution  was  achieved.  The 
exception  occurs  in  problem  5,  TABLE  5.  This  turned  out  to 
be  a  difficult  problem  for  the  procedure.  For  a  radius  of  2 
the  final  policy  set  was  non-optimal.  However,  the  gain  for 
that  policy  set  was  within  0.056%  of  optimal.  In  addition, 
TABLE  5  problem  6,  solutions  of  radius  Jl  and  rs  were  not 
solved  optimally  initially.  They  were  solved  optimally  with 
the  following  simple  extension  to  the  procedure.  When  the 
procedure  converges,  check  a  larger  radius  (/5  in  this  case) 
on  the  last  policy  iteration.  If  the  convergence  criteria  Is 
still  satisfied,  stop. 

It  is  important  to  note  that  Implementation  of  the 
procedure  is  extremely  easy.  The  code  was  written  in  FORTRAN 
IV  with  only  the  simpliest  of  list  processing  applications 


(a  few  pointers  is  all  that  is  necessary).  By  arranging 
storage  so  that  variables  associated  with  a  particular  state 
are  grouped  together,  the  virtual  memory  software  of  the  VAX 
11/750  (or  any  other  virtual  machine)  automatically  keeps 
those  segments  of  memory  associated  with  the  recurrent  set 
plus  neighborhood  in  core.  Non-active  states  are  cycled  to 
the  disk.  This  is  particularly  important  for  large  problems. 

In  summary,  an  approach  to  solving  specially  structured 
large  scale  Markov  Decision  Processes  has  been  presented. 
Experimental  testing  indicates  that  the  computational  savings 
over  conventional  (all  states)  approaches  are  considerable 
particularly  for  larger  scale  problems.  Finally,  the 
procedure  is  easily  programmed  and  can  be  readily  configured 
to  take  advantage  of  the  natural  strategies  of  virtual  memory 
computers . 
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problem*  j  lj  2  j  3  J  4  j  5  j  6  | 


{setup  costs 

stage 

1 

1 

{440 

1 

5 

1 

1 

40 

10 

1 

1  5 

stage 

2 

50 

!  4 

1 

1 

10 

1 

1 

4 

5 

i  io 

1 

stage 

3 

100 

!  i 

i 

i 

t 

1 

1 

5 

1 

1 

1 

I 

1 

20 

J  5 

i 

i 

{echelon 

stage 

1 

10 

{  5 

1 

7 

1 

5 

1 

J 1 .3 

{  holding 

stage 
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1 

S  4 

1 

1 

15 

1 

1 

4 

0.5 

{2.7 

{  cost 

stage 

3 

1 

!  3 

1 

1 

1 

1 

1 

7 

1 

1 

1 

1 

3 

2 

J 1 .3 

i 

i 

{ variable 

stage 

1 

10 

{  50 

1 

7 

1 

50 

10 

{  7 

|  production 

stage 

2 

1 

{  40 

t 

1 

15 

1 

1 

40 

5 

!  15 

{cost 

stage 

3 

1 

J  30 

i 

1 

1 

1 

1 

7 

1 

1 

1 

1 

30 

20 

!  7 

1 

1 

{ shor tage ( TABLES 

2,4, 

6) 

100 

{200 

1 

100 

1 

200 

50 

{100 

{ cost  (TABLES 

3,5, 

7) 

200 

1400 

i 

1 

1 

1 

200 

1 

1 

1 

400 

100 

!  200 

1 

j  demand(d) 

d-0 

.25 

{  .25 

1 

.  15 

1 

.25 

.  15 

|  .25 

{ distribution 

d»l 

.50 

{  .50 

1 

1 

.20 

1 

1 

.50 

.20 

{  .50 

d-2 

.25 

{  .25 

1 

1 

.30 

1 

1 

.  25 

.  30 

{  .  25 

d-3 

1 

1 

l 

1 

.20 

1 

1 

.20 

1 

1 

d«4 

1 

1 

1 

1 

1 

1 

1 

.  15 

1 

1 

1 

1 

.  15 

1 

1 

1 

1 

*  11  levels  of  inventory  (0-10)  for  each  stage 

results  in  121  states  for  each  2-stage  problem 
(TABLES  2,3)  and  1331  states  for  each  3-stage 
problem  (TABLES  A, 5).  22  levels  of  inventory  for 

each  stage  results  in  10648  states  for  each 
3-stage  problem  (TABLES  6,7). 


TABLE  1  Test  problem  input  data 


20 


(VAX  11/750) 

CPU  time  in 

Sec  . 

1 

jail  1 2 1 { 

! s tates  ! 

i  i 

i  i 

i  i 

1 

r~l  | 

i 

i 

i 

i 

n 

* 

r-2  ! 

{Average 

r«  ol  Computation 
! Reduction 

i 

i 

1 

problem 

1  j 

1 

- 9773  j 

6.53  ! 

1 

6.91  | 

7.38! 

7.74J 

i 

H 

00 

04 

problem 

2  ! 

1 

1 

9.96  ! 

2.45  ! 

i 

i 

2.49  j 

3.56  ! 

3.66  I 

i 

69.5% 

problem 

3  ! 

1 

5.95', 

0.74  J 

i 

i 

0.83  J 

1.10! 

1.24! 

i 

83. 6% 

problem 

4  J 

i 

i 

5.53  J 

l.ioj 

i 

i 

1.16| 

00 

cm 

. 

H 

1.46J 

i 

77.4% 

problem 

5  r 

i 

i 

5.68  ! 

1.10  J 

1 

1 

1.26  | 

2.40! 

2.26  | 
i 
i 

*4 

Csl 

. 

00 

problem 

6  ! 
i 
i 

5.55  | 

"o.'97 ! 

1 

1 

1.16  i 

1.27  ! 

1.39! 

i 

i 

78.4% 

TABLE  2 


problem 


problem 


problem 


problem 


►  1  _ 

problem 


problem 


(VAX  11/750)  CPU  time  in  Sec. 


! all  121 J 
!  states | 

i  i 
i  i 
i  i 

r-1  ! 

i 

i 

i 

i 

r-/2  ! 

1 

1 

1 

1 

T 

l 

r=2  ! 

i 

i 

i 

i 

\ Average 

r* /5 | Computat ion 
! Reduction 

i 

i 

i  ! 

1 

00 

vO 

* 

6.86  | 
i 
i 

7.93  | 

i 

i 

9.99! 

i 

i 

9.86  | 

i 

10.5% 

2  ! 

1 

1 

9.20  ! 

i 

i 

1.69! 

i 

i 

1.74  i 
< 

2.17  ! 

i 

i 

2.65  ! 

i 

i 

7  7.6 %  ~ 

3  ! 

1 

5.89! 

i 

■  1 

0.96  | 

i 

.  ,  ,  1 

1.07! 

i 

1.33! 

i 

t  j  1 

1.561 

i 

79 . 1% 

4  1 

5.57; 

1 

1 

1.09! 

1 

1 

1 . 2 1  j 

1 

^TSI\ 

i 

i 

6 .  9b  } 

i 

3T.  7T 

5  ! 

1 

6.99! 

i 

0.95  ! 

i 

1 . 09  ! 

1 

1 

1.57  ! 

i 

i 

00 

CM 

. 

CM 

78.9% 

6  ! 

1 

.1 

. 

00 

HI 

m 

m 

tfii 

35.8% 

Note;  r=*l  ;  each  state  has  4  neighboring  states 
r*/2:  each  state  has  8  neighboring  states 
r-2:  each  state  has  12  neighboring  states 
r«/5:  each  state  has  20  neighboring  states 


TABLE  3 


(VAX  il/750)  CPU  time  in  Sec. 


r=/5 


problem  1|  636.27 


8172.70 

i 

i 


9. 


5.28  9.8 


2 


.90J22 

i 

i 


15.42 


8.52  9.76  12.08 


problem  5 


problem  6 


606.74ll2.40ill.22J  14.881  14.531  25.47 

lit  i  i 

iii  i  i 


31.23 


TABLE  4 

(VAX  11/750)  CPU  time  in  Sec. 


Average 

Computation 

Reduction 


97.9  % 


98.9% 


87.9% 


99.6% 


97.4% 


1 

1 

1 

1 

1 

1 

all  1331  ;  i 
states  J  r^l  J 

i  i 
i  i 
t  i 
i  i 

r-/2  j 

i 

i 

! 

r*  ST  ! 

1 

1 

1 

1 

i 

r«2  j 

i 

i 

i 

r*  /5  j 

1  problem 

1  j 

1 

1358.61  |  13.88 
!  1 

1  1 

>-* 

ON 

O 

18.86  | 
i 
i 

21.73  | 

i 

[ 

31.80J 

1  problem 

2  ! 

1 

997.62 ! 28.62  | 

i  i 

i  i 

49.13J 

i 

i 

55.04  | 

i 

i 

53.10! 

» 

T  i 

96.66  1 

1  problem 

3  ! 

1 

961.36  J  18  •  00  J 

t  i 

\  i 

22 . 59  J 

i 

i 

24.19J 

i 

i 

24.83  J 

i 

i 

28.31  1 

j  problem 

4j 

939. 59  j 10.86  ! 

i  i 

i  i 

12.14! 

i 

i 

13.09! 

i 

i 

14.17| 

i 

i 

14.39J 

| problem 

5  ! 

1 

446.93j89.8l! 

1  ! 

1  1 

226.24  J 

i 

263.95  ! 

i 

i 

86.25  | 

i 

i 

125.65  1 

1  problem 

6 ! 
i 

842.57  J  21 . 08  J 

i  i 

i  i 

00 

. 

00 

■** 

54.35  i 

i 

i 

31.04  ! 

i 

i 

33.401 

Average 
Computat ion 
Reduction 

98 

94 

97.5% 

98 
64 
95.5% 


Note 


r  =  1  : 

each 

state 

has 

6 

neighboring 

states 

r  =  /£: 

each 

state 

has 

18 

neighboring 

states 

r=  ST: 

each 

state 

has 

26 

neighboring 

states 

r*2  : 

each 

state 

has 

32 

neighboring 

states 

r*  ✓5': 

each 

state 

has 

56 

neighboring 

states 

TABLE  5 


'  "V'Vv.  ■'.V.V.V.W'.'.V.'/'." 
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4.14}  5.90}  5.74}  8.95}  10.97}  99.9%  } 


j  problem  5 }  81  37 .63  j  4-3 

i  i  i 

i  i  i 


problem  6}  1 0 7 7 7  .  6 0  [  3 4 . 4 2  }  2 5 . 3 8 


10.98  14.39 


99 . 9% 


.23}  40.19}  75.90}  99.6% 

i  i  i 

i  I  i 


I  problem  2 

i 


problem  3 


problem  4 


7002 . 88 


7276 . 78 


8947.70 


33 . 04 


7.67 


20.87 

29.00 

10.70 

17.00 

24 . 13 

36 . 61 

9741!  9.64 

i 


problem  6}  9 7 54  .  2 2 } 4 0 . 77  }  50 . 43 


12.62 

“40719 


60.39 


99.9% 


99.5% 


Note 


r  =  l  : 

each 

state 

has 

6 

neighboring 

states 

r  =  t'T: 

each 

state 

has 

18 

neighboring 

states 

r=  /T: 

each 

state 

has 

26 

neighboring 

states 

r  =  2  : 

each 

state 

has 

32 

neighboring 

states 

r  =  v5~: 

each 

state 

has 

56 

neighboring 

states 

TABLE  7 
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